Where v is the testing instance, ci is the target attribute value, aj is a data attribute,
and vj is its value (Shouman, Turner & Stockers, 2012). Many existing efficient clus-
tering algorithms are hard to implement inside a RDBMS where the programmer
needs to worry about storage management, concurrent access, memory leaks, fault tol-
erance, security, and so on. Using SQL has not been considered an efficient and feasi-
ble way to implement data mining algorithms. SQL does not work with data mining,
machine learning, or statistical algorithms. During my research I’ve found that it is
possible to get an efficient SQL implementation of the popular K-means clustering al-
gorithm that can work on top of a relational DBMS. As I mentioned earlier research
done by Carlos Ordonez, from a performance point of view, explains how to cluster
large data sets defining and indexing tables to store and retrieve intermediate and re-
sults, optimizing and avoiding joins, improving and simplifying clustering aggrega-
tions, and taking advantage of sufficient statistics. The basic input for K-means is a
dataset Y containing n points in d dimensions, Y = {y1, y2,..., yn}, and k, the desired
number of clusters. The output is three matrices W, C, R, containing k weights, k
means and k variances respectively corresponding to each cluster and a partition of Y
into k subsets. Matrices C and R are d x k and W is k x 1. Throughout this work three
subscripts are used to index matrices: i = 1,..., n, j = 1,..., k, l = 1,...,d. To refer to one
column of C or R we use the j subscript (e.g. Cj, Rj); Cj can be understood as a d-di-
mensional vector containing the centroid of the jth cluster having the respective
squared radiuses per dimension given by Rj. For transposition we will use the T su-
perscript. For instance, Cj refers to the jth centroid in column form and CTj is the jth
centroid in row form. Let’s say we’ve split the dataset Y into k groups, called X₁, X₂,
..., Xₖ, based on clustering. Each group is separate, with no overlap between them. K-
means works by measuring how close each data point yᵢ is to the center of a cluster
(Cⱼ). It uses Euclidean distance, which is basically the straight-line distance between
two points. The formula for this distance is:
d(yᵢ, Cⱼ) = √[(yᵢ₁ - Cⱼ₁)² + (yᵢ₂ - Cⱼ₂)² + ... + (yᵢ_d - Cⱼ_d)²]
In other words, it adds up the squared differences between each feature of the point
and the centroid, then takes the square root (Ordonez, 2004). Centroids Cj are gener-
ally initialized with k points randomly selected from Y for an approximation when
there is an idea about potential clusters. The algorithm iterates through executing the
E and the M steps starting from some initial solution until cluster centroids become
stable. The E step determines the nearest cluster for each point and adds the point to
it. That is, the E step determines cluster membership and partitions Y into k subsets.
The M step updates all centroids Cj by averaging points belonging to the same cluster.
Then the k cluster weights Wj and the k diagonal variance matrices Rj are updated
based on the new Cj centroids. The quality of a clustering solution is measured by the
average quantization error q(C). K-means try to make each group as tight and com-
pact as possible. It does this by finding the best set of centroids (group centers) that
minimize the average distance between each data point and the center of the group it
belongs to. It calculates the average of all the distances between each point yᵢ and its
assigned centroid Cⱼ and tries to make that number called q(C), as small as possible.
This quantity measures the average squared distance from each point to the cluster
where it was assigned according to the partition into k subsets. K-means finishes
when the value it’s trying to minimize, q(C), changes by only a tiny amount between
steps, meaning the clusters have stabilized. K-means is theoretically guaranteed to
converge decreasing q(C) at each iteration, but it is common to set a maximum num-
ber of iterations to avoid long runs (Ordonez, 2004). There are two main schemes pre-
sented in Ordonez’s research. The first one presents a simple implementation of K-